Injecting Uncertainty in Graphs for Identity Obfuscation 



Paolo Boldi Francesco Bonchi Aristides Gionis Tamir Tassa 

Universita degli Studi Yahoo! Research The Open University 
Milano, Italy Barcelona, Spain Ra'anana, Israel 

boldi@dsi.unimi.it {bonchi,gionis}@yahoo-inc.com tamirta@openu.ac.il 



ABSTRACT 

Data collected nowadays by social-networking applications 
create fascinating opportunities for building novel services, 
as well as expanding our understanding about social struc- 
tures and their dynamics. Unfortunately, publishing social- 
network graphs is considered an ill-advised practice due to 
privacy concerns. To alleviate this problem, several anony- 
mization methods have been proposed, aiming at reducing 
the risk of a privacy breach on the published data, while still 
allowing to analyze them and draw relevant conclusions. 

In this paper we introduce a new anonymization approach 
that is based on injecting uncertainty in social graphs and 
publishing the resulting uncertain graphs. While existing ap- 
proaches obfuscate graph data by adding or removing edges 
entirely, we propose using a finer-grained perturbation that 
adds or removes edges partially: this way we can achieve the 
same desired level of obfuscation with smaller changes in the 
data, thus maintaining higher utility. Our experiments on 
real-world networks confirm that at the same level of iden- 
tity obfuscation our method provides higher usefulness than 
existing randomized methods that publish standard graphs. 

1. INTRODUCTION 

Preserving the anonymity of individuals when publishing 
social-network data is a challenging problem that has re- 
cently attracted a lot of attention [2, 22]. The methods that 
have been proposed so far for anonymizing social graphs can 
be classified into three main categories: (1) methods that 
group vertices into super- vert ices of size at least k, where k 
is the required level of anonymity; (2) methods that provide 
anonymity in the graph via deterministic edge additions or 
deletions; and (3) methods that add noise to the data in the 
form of random additions, deletions or switching of edges. 

In this paper we introduce a new graph-anonymization 
method that does not fall in any of the above three cate- 
gories. Our method injects uncertainty in the existence of 
the edges of the graph and publishes the resulting uncertain 
graph, that is, a graph where each edge e has an associated 
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(a) (b) 
Figure 1: (a) A graph; (6) a possible obfuscation. 

probability p(e) of being present. Injecting a limited amount 
of uncertainty in the data, in order to reach a desired level 
of identity obfuscation, is a natural approach [1]. For in- 
stance, the /c-anonymity framework for relational data [25, 
28] is typically based on injecting uncertainty by means of 
attribute generalization; for example, generalizing an exact 
numerical value to a range of values. 

In the context of graph anonymization, our approach can 
be seen as a generalization of random-perturbation methods, 
which randomly delete existing edges and add non-existing 
edges [12]. From a probabilistic perspective, adding a non- 
existing edge e corresponds to changing its probability p(e) 
from to 1, while removing an existing edge corresponds to 
changing its probability from 1 to 0. In our method, instead 
of considering only binary edge probabilities, we allow prob- 
abilities to take any value in [0, 1], thus allowing for greater 
flexibility. The underlying intuition is that by using finer- 
grained perturbation operations, one can achieve the same 
desired level of obfuscation with smaller changes in the data, 
thus maintaining higher data utility. 

An example of the proposed obfuscation method is shown 
in Figure 1: The graph (a) is the original graph that needs 
to be obfuscated; the published graph (b) is a possible ob- 
fuscation. While vertices v\ and V2 are connected in (a), in 
(b) they are connected with probability p(vi,V2) = 0.7, rep- 
resenting a reduction of 0.3 in the certainty of existence of 
the edge (v±, V2). Vertices V3 and 1*4, which are connected in 
(a), are no longer connected in the published graph (6), i.e., 
p(v3,V4) = 0. Vertices V2 and V3, which were not connected 
in (a), are connected with probability 0.8 in (6), correspond- 
ing to a partial creation of an edge. 

A natural question that arises is how to query and ana- 
lyze data that is published in the form of an uncertain graph. 
Hence, in order to prove the practical relevance of our pro- 
posal, not only we need to show that the uncertain graph 
maintains high utility, which we measure as similarity to 
the original graph in terms of characteristic properties, but 
also that the computation of these properties can be carried 
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out efficiently. An essential part of our discussion will be 
devoted to this. Fortunately, an increasing research effort 
was dedicated in recent years to the topic of querying and 
mining uncertain graphs [14, 15, 24, 36, 37, 38]: this body 
of research comes to our aid, providing evidence that useful 
analysis can be carried out on uncertain graphs. 
In this work we achieve the following contributions: 

• We introduce and formalize the idea of injecting uncer- 
tainty in graphs for identity obfuscation. In particular, 
we formally define the notion of (k, s)- obfuscation for 
uncertain graphs (Section 3). 

• We provide methods for assessing the level of obfusca- 
tion achieved by an uncertain graph with regards to the 
degree property (Section 4). 

• We introduce our method for injecting uncertainty in a 
graph for (k, e) -obfuscation (Section 5). 

• In Section 6, we discuss several graph statistics and 
methods to compute them efficiently in uncertain 
graphs. These statistics are then used in Section 7 to 
assess the utility of the published uncertain graph. 

• Our experimental assessment on three large real-world 
networks proves that at the same obfuscation levels, 
our method maintains higher data utility than existing 
random-perturbation methods. 

In the next section we review the relevant literature, while 
in Section 8 we conclude the paper and suggest future work. 

2. RELATED WORK 

As we already mentioned, methods for anonymizing so- 
cial networks can be broadly classified into three categories: 
generalization by means of clustering of vertices; determin- 
istic alteration of the graph by edge additions or deletions; 
randomized alteration of the graph by addition, deletion or 
switching of edges. 

In the first category, Hay et al. [10, 11] propose to gen- 
eralize a network by clustering vertices and publishing the 
number of vertices in each partition together with the den- 
sities of edges within and across partitions. Campan and 
Truta [5] study the case in which vertices contain additional 
attributes, e.g., demographic information. They propose to 
cluster the vertices and reveal only the number of intra- and 
inter-cluster edges. The vertex properties are generalized in 
such a way that all vertices in the same cluster have the 
same generalized representation. Tassa and Cohen [29] con- 
sider a similar setting and propose a sequential clustering 
algorithm that issues anonymized graphs with higher utility 
than those issued by the algorithm of Campan and Truta. 

Cormode et al. [7, 8] consider a framework where two sets 
of entities (e.g., patients and drugs) are connected by links 
(e.g., which patient takes which drugs), and each entity is 
also described by a set of attributes. The adversary relies 
upon knowledge of attributes rather than graph structure 
in devising a matching attack. To prevent matching at- 
tacks, their technique masks the mapping between vertices 
in the graph and real-world entities by clustering the ver- 
tices and the corresponding entities into groups. Zheleva 
and Getoor [33] consider the case where there are multiple 
types of edges, one of which is sensitive and should be pro- 
tected. It is assumed that the network is published without 
the sensitive edges and the adversary predicts sensitive edges 
based on the observed non-sensitive edges. 



In the second category of methods, Liu and Terzi [19] 
consider the case that a vertex can be identified by its degree. 
Their algorithms use edge additions and deletions in order 
to make the graph k- degree anonymous, meaning that for 
every vertex there are at least k — 1 other vertices with the 
same degree. 

Zhou and Pei [34] consider the case that a vertex can be 
identified by its radius-one induced subgraph. Adversar- 
ial knowledge stronger than the degree is also considered 
by Thompson and Yao [30], who assume that the adver- 
sary knows the degrees of the neighbors, the degrees of the 
neighbors of the neighbors, and so forth. Zou et al. [35] and 
Wu et al. [31] assume that the adversary knows the com- 
plete graph, and the location of the vertex in the graph; 
hence, the adversary can always identify a vertex in any 
copy of the graph, unless the graph has other vertices that 
are automorphically-equivalent. 

In the last category of methods, Hay et al. [12] study the 
effectiveness of random perturbations for identity obfusca- 
tion. They concentrate on degree-based re-identification of 
vertices. Given a vertex v in the real network, they quan- 
tify the level of anonymity that is provided for v by the 
perturbed graph as (max n {Pr(i> | u)}) -1 , where the maxi- 
mum is taken over all vertices u in the released graph and 
Pr(v | u) stands for the belief probability that u is the im- 
age of the target vertex v. By performing experimentation 
on the Enron dataset, using various values for the number 
h of added and removed edges, they conclude that in order 
to achieve a meaningful level of anonymity for the vertices 
in the graph, h has to be tuned so high that the resulting 
features of the perturbed graph no longer reflect those of the 
original graph. 

Ying et al. [32] compare random-perturbation methods 
to the method of /c-degree anonymity [19]. They too use 
the a-posteriori belief probabilities to quantify the level of 
anonymity. Based on experimentation on two modestly- 
sized datasets (Enron and Polblogs) they conclude that the 
deterministic approach for /c-degree anonymity preserves the 
graph structure better than random-perturbation methods. 

In a more recent study, Bonchi et al. [4] take a differ- 
ent approach, by considering the entropy of the a-posteriori 
belief probability distributions as a measure of identity ob- 
fuscation. The rationale is that while using the a-posteriori 
belief probabilities is a local measure, the entropy is a global 
measure that examines the entire distribution of these belief 
probabilities. Bonchi et al. show that the entropy measure 
is more accurate than the a-posteriori belief probability, in 
the sense that the former distinguishes between situations 
that the latter perceives as equivalent. Moreover, the ob- 
fuscation level quantified by means of the entropy is always 
greater than the one based on a-posteriori belief probabil- 
ities. Finally, by means of a thorough experimentation on 
three large datasets, using several graph statistics and com- 
paring also to Liu and Terzi [19], they demonstrate that 
random perturbation could be used to achieve meaningful 
levels of obfuscation while preserving most of the features of 
the original graph. 

3. OBFUSCATION BY UNCERTAINTY 

Let G — (V, E) be an undirected graph, where V is the set 
of vertices and E is the set of edges. We write Vb to denote 
the set of all (™) unordered pairs of vertices from V, that is, 
V2 = {(vi,Vj) I 1 < i < j < n}. The goal is to anonymize 
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the graph G so that the identity of its vertices is obfuscated. 
We propose to publish G as an uncertain graph G — (V, p), 
formally defined as follows. 

Definition 1. Given a graph G — (V,E), an uncertain 
graph on the vertices of G is a pair G = (V, p) ; where 
p : Vi — > [0, 1] is a function that assigns probabilities to 
unordered pairs of vertices. 

The original graph G and the uncertain graph G have the 
same set of vertices V. For the sake of clarity, we write 
dGG when we speak about a vertex in G, and dgG when 
we speak about a vertex in G. 

Since the mere description of an uncertain graph con- 
sists of | V2 1 = n(n — l)/2 probability values, we propose 
to inject uncertainty only to a small subset of pairs of ver- 
tices. Namely, given a graph C, we create a subset Ec C V2 
of candidate edges, and then we inject uncertainty only to 
the pairs of vertices in Ec, while we implicitly assume that 
p(u,v) = for all (u,v) Ec- The size of Ec will be set 
so that \Ec\ = c\E\, for a small constant c > 1. In Section 
5 we describe a strategy for selecting Ec, given G and a 
user-defined parameter c. 

The uncertain graph G induces a collection of possible 
worlds W(G). A possible world W £ W(G) is a graph 
W = (V,Ew)j where Ew Q Ec- The edge probabilities in 
the uncertain graph G imply that the probability of W is 

pi(w)= n p( e )- n (!-p( e ))- 

e£E w eeE c \E w 

Let us consider the knowledge that an adversary may ex- 
tract from such an uncertain graph about a given target 
vertex in G. Following the literature, we assume that the 
adversary knows some vertex property P of his target ver- 
tex [4, 12, 19, 30, 31, 32, 34, 35]. Examples of such prop- 
erties, as discussed in Section 2, are the degree, the degrees 
of the vertex and its neighbors, and the neighborhood sub- 
graph induced by the target vertex and its neighbors. 

Let Qp be the domain in which P takes values, e.g., if P 
is the degree property then Qp = {0, . . . , n — 1}. Given an 
uncertain graph G and a property P, for each v £ G and 
uj £ Qp we define the probability X v (uj) that v originated 
from a vertex in G with property value uj. Specifically, 

X v (u)= J2 P*(W)-XvAW), (2) 

WeW(G) 

where Pr(W) is given in Equation (1), and \v,uj (W) is a 0-1 
variable that indicates if the vertex v has the property value 
uj in the possible world W. In other words, X v (uj) is the sum 
of probabilities of all possible worlds in which the vertex v 
has the given property value uj. 

The probabilities X v (uj) may be arranged in a nx \Qp | ma- 
trix, where each row corresponds to one vertex v £ G and it 
gives the corresponding probability distribution X v (uj) over 
all possible values uj £ Qp. The columns of that matrix 
are proportional to the probability distributions that corre- 
spond to property values. More precisely, the normalized 
column corresponding to property uj £ Qp, i.e., 



is the probability that v is the image in G of a vertex that 
had the property uj in G. 



X v {uj) 


deg=0 


deg=l 


deg=2 


deg=3 


vv 


0.006 


0.092 


0.398 


0.504 


v 2 : 


0.054 


0.348 


0.542 


0.056 


v 3 : 


0.020 


0.260 


0.720 


0.000 


V4: 


0.180 


0.740 


0.080 


0.000 



Y u (v) 


deg-0 


deg-1 


deg=2 


deg-3 


Vl' 


0.023 


0.064 


0.229 


0.900 


v 2 : 


0.208 


0.242 


0.311 


0.100 


v 3 : 


0.077 


0.180 


0.414 


0.000 


V4: 


0.692 


0.514 


0.046 


0.000 



Table 1: The matrices X v (uj) and Y u (v) for the uncer- 
tain graph in Figure 1(b) and the degree property. 

Example 1. Consider the uncertain graph in Figure 1(b) 
and assume property Pi. Table 1 gives the corresponding 
matrix X v (uj), in which each row gives the probability dis- 
tribution regarding the degree of the corresponding vertex 
in G. For instance, the probability that v± has degree 2 is 
0.7-0.9-(l-0.8)+0.7-(l-0.9)-0.8+(l-0.7)-0.8-0.7 = 0.398. 

The columns of X v (uj), after normalizing them, give the 
corresponding Y u (v) distributions for each value of the degree 
(shown also in Table 1). For instance, if we look for a vertex 
that has degree 3 in G, it is either v\, with probability 0.9, 
or V2, with probability 0.1. 

To further stress the difference between the two probabil- 
ity distributions, X v (uj) and Y u (v), let us consider an uncer- 
tain graph G in which all edge probabilities are either or 1 
(i.e., a certain graph). Let uj be some property value in Qp 
and assume that P _1 (uj) — {vi 1 , . . . , Vi k } (namely, there are 
exactly k vertices with the property uj in the graph). Then, 
for all v £ P _1 (cj), X v (uj) = 1 (since each of them has the 
property uj with certainty) and X v (uj') = for any other 
property uj' 7^ uj (since any vertex can have in any certain 
graph just one property). Furthermore, X v (uj) = for all 
v £ P~\uj)- Let us now turn to consider the column in the 
matrix that corresponds to uj. Then Y u (v) — 1/k for each 
of the k vertices in P _1 (cj) and Y u (v) = for all other ver- 
tices since if we look for a specific vertex in the graph with 
property uj and that is the only information that we know 
about that sought-after vertex, then it can be any one of the 
vertices in P _1 (cj) with probability 1/k. 

We are ready to define our notion of privacy. 

Definition 2 ((&, £)-Obfuscation). Let P be a vertex 
property, k > 1 be a desired level of obf us cation, and e > 
be a tolerance parameter. The uncertain graph G is said to 
k- obfuscate a given vertex v £ G with respect to P if the 
entropy of the distribution Y P ^ over the vertices of G is 
greater than or equal to log 2 k: 

H(Yp (v) )>\og 2 k. 

The uncertain graph G is a (k, e)- obf us cation with respect to 
property P if it k- obfuscates at least (1 — e)n vertices in G 
with respect to P. 

Namely, given the considered attack scenario, in which the 
adversary uses a background knowledge of property P of his 
target vertex, we wish to lower bound the entropy of the 
distribution it induces over the obfuscated graph vertices by 
log 2 k (in similarity to the privacy goal in k- anonymity). As 
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for the tolerance parameter e, it serves the following pur- 
pose. Considering the fact that degree sequences in typical 
social networks have very skewed distribution, trying to ob- 
fuscate some very unique vertices (such as Barack Obama or 
CNN in twitter or Facebook) is on the one hand hopeless, 
and on the other hand not necessarily needed: these vertices 
do not represent "normal" users, and identifying them does 
not disclose anyone's personal information. In fact, as we 
will see later, our obfuscation algorithm guarantees that the 
e-fraction of vertices for which the privacy requirement is 
not satisfied can be forced to be taken from some specific 
sub-population; for example, in the case of degree obfusca- 
tion they are vertices with high degree. 

Example 2. Consider again the graph in Figure 1. Ver- 
tex vi has degree 3 in the original graph. Thus, in order to 
check the level of obfuscation of this vertex in the obfuscated 
graph we have to measure the entropy of the column deg = 3 
of Table Y u (v). That entropy is approximately 0.469, which 
is rather low, meaning that the identity of v\ is not obfus- 
cated enough in the uncertain graph in Figure 1(b). Vertex 
V2 has degree 1 in the original graph. The entropy of the 
column deg = 1 is ~ 1.688 > log 2 3. Vertices V3 and va 
have degree 2, and the entropy of the corresponding column 
is ~ 1.742 > log 2 3. Therefore, as three out of four ver- 
tices are 3- obfuscated, the graph in Figure 1(b) provides a 
(3, 0.25) -obfuscation for the graph in Figure 1(a). 

4. QUANTIFYING THE OBFUSCATION 

In this section we describe how to compute the level of 
obfuscation with regard to the degree property. When P 
is the degree, Qp = {0, ... , n — 1}, and, consequently, the 
matrix has n rows and n columns. We need to describe how 
to compute X v (uj) for all v £ G and uj £ {0, ... ,n — 1}. 
Once the full matrix X v is given, it is possible to derive the 
distributions Y u over the vertices of G for all uj £ P(G) and 
then verify the /c-obfuscation property. 

Fix v £ G and let e\ , . . . , e n -i be the n—1 pairs of vertices 
that include v. For each 1 < i < n — 1, e^isa Bernoulli ran- 
dom variable that equals 1 with some probability pi . Letting 
d v be the random variable corresponding to the degree of v, 
we have 

n-l 

d v = ( 4 ) 

i=l 

Then for each possible degree uj £ Qp of v, we have X v (uj) = 
Pi(d v — uj). 

Lemma 1. The probability distribution of d v may be com- 
puted exactly in time 0(n 2 ). 

PROOF. Let d l v := Y^i=i e * denote the partial sum of the 
first £ Bernoulli random variables. We will show that once 
we have the distribution of df, -1 , we can compute that of d l v 
in time 0(t). Hence, the distribution of d v = can be 

computed in time Yle=i ®W ~ G(n 2 ). Indeed, 



Pr(4 = j) = Pr^" 1 =j-l)-p e + Prid"- 1 



^3) 



(1 -Pi) 



Therefore, computing a single probability in the distribution 
of di takes constant time (given the full distribution of df, -1 ), 
and, consequently, computing the entire distribution of d* 
over all < j < £ takes time O(i). □ 



It should be noted that since we choose to inject uncer- 
tainty only to a subset Ec of pairs of vertices, the sum in 
Equation (4) is taken only over the pairs of vertices in Ec 
that include the vertex v. Hence, if d is the average de- 
gree in G, the average number of addends in d v is dc, where 
c=\E c \/\E\. 

In cases where the sum in Equation (4) has a large number 
of addends, we may adopt an alternative approach. Since 
d v is the sum of independent random variables, it may be 
approximated by the normal distribution iV(/i, a 2 ), where 
A* =_P^E(e.) = Er=i X Pi a 2 = E^ 1 Var(e0 = 
Er=i PiO- ~ P 1 ) as implied by the Central Limit Theorem 
[16]. (The Central Limit Theorem becomes effective already 
for n ~ 30; for typical sizes of n in social networks, the 
normal approximation becomes very accurate.) Specifically, 
0+1/2 , 



Pr(d v =uj) « f^l^^ a (x)dx for uj £ Qp = {0, 
1}, where 



(s-A*r 



V2^ 



(5) 



5. INJECTING UNCERTAINTY 

In this section we describe our algorithm, which, given a 
graph G, a desired level of obfuscation /c, and a tolerance 
parameter s, injects a minimal level of uncertainty to the 
graph so that it becomes (k, e)-obfuscated with respect to a 
vertex property P. 

5.1 Overview 

As discussed in Section 3, we inject uncertainty in the 
graph by assigning probabilities to a subset Ec C Vi of 
pairs of vertices, such that \Ec\ = c\E\, for a small constant 
parameter c. The selection of Ec is described in a subse- 
quent section. Once Ec is selected, only the pairs e £ Ec 
will become uncertain edges in G. All other pairs e ^ Ec 
will be certain non-edges, i.e., p(e) = 0. To establish the 
uncertainty of each pair e £ Ec , we select a random pertur- 
bation r e £ [0, 1]. If e £ E, it becomes an uncertain edge in 
G with probability p(e) = 1 — r e ; if e £ Ec \ E, it becomes 
an uncertain edge with probability p(e) = r e . 

In order for the uncertain graph G to preserve the charac- 
teristics of the original graph G, smaller values of the pertur- 
bation parameter r e should be favored. A natural candidate 
for the generating distribution of r e is the [0, 1] -truncated 
normal distribution, 



Ra(r) :-- 



Jo ®0,a(x)dx 





< r < 1 
otherwise, 



(6) 



where $ M)Cr is the density function of a Gaussian distribution 
provided in Equation (5). As the standard deviation a of 
the normal distribution decreases, a greater mass of R a will 
concentrate near r = and then the amount of injected un- 
certainty will be smaller. Thus, small values of a contribute 
towards better maintaining the characteristics of the origi- 
nal graph, but at the same time they provide lower levels of 
obfuscation. Larger values of a have the opposite effect. 

A key feature of our method is to select judiciously the 
perturbation r e for each pair e = (u,v) £ Ec, depending 
on properties of the vertices u and v. Hence, the random 
variable r e is drawn from R a ( e ), where the parameter a(e) 
depends on the vertices that e connects. The perturbation 
will be larger for edges that connect more unique vertices, 
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which, consequently, require higher levels of uncertainty to 
"blend in the crowd," and smaller for edges that connect 
more "typical" vertices. 

Additionally, in order to prevent identifying pairs e £ Ec 
that are true edges in G (by turning every pair e £ Ec 
to an edge if p(e) > 0.5 and to a non-edge otherwise), the 
perturbation r e is drawn from the uniform distribution in 
[0, 1], rather than from the distribution R a , for a g- fraction 
of the pairs e £ Ec, with < g < 1. 

5.2 Uniqueness Scores of Vertices 

For certain properties of interest, such as degree, the ma- 
jority of vertices in real- world graphs are already anonymous 
even without random perturbations. The reason is that for 
most values of the property P there are many vertices that 
have that value. Hence, we aim at controlling the amount of 
applied perturbation, so that larger perturbation is added 
at vertices that are less anonymized in the original graph. In 
particular, we suggest to calibrate the perturbation applied 
to a pair e = (u, v) £ Ec according to the "uniqueness" of 
the two vertices u and v with respect to the property P. 
Namely, if both P(u) and P(v) are frequent values, then r e 
should be very small; on the other hand, if P(u) and P(v) 
are outlier values, then r e should be higher. We proceed to 
explain our method in detail. 

Let P : V — ► be a property defined on the set of 
vertices V. Further, consider a distance function d between 
values in the range Qp of P. So, for each pair of values, 
uj^uj' £ Qp, a distance d^uj^uj') > is defined. For example, 
for the degree property Pi, the distance d is the modulus 
of the difference of two degrees, while for the radius-one 
subgraph property (P3), the distance d is the edit distance 
between two subgraphs. 

Definition 3. Let P : V — ► Op be a property on the 
set of vertices V of the graph G, let d be a distance func- 
tion on Qp, and let > be a parameter. Then the 0- 
commonness of the property value uj £ Qp is Cq{uj) \— 
^ veV &o,e(d(w, P(v))), while the 0-uniqueness of uj £ Qp 
is U e (oj) := c^j. 

In the above definition the function $ is the Gaussian 
distribution given by Equation (5). The commonness of the 
property value uj is a measure of how typical is the value uj 
among the vertices of the graph. It is obtained as a weighted 
average over all other property values uj 1 ', where the weight 
decays exponentially as a function of the distance between 
uj and uj' . The uniqueness is the inverse of the commonness. 
It should be noted that the commonness and uniqueness are 
meaningful only as relative measures, as they allow to assess 
how one property value is more common, or more unique, 
in G than another property value. 

Commonness and uniqueness scores depend on the pa- 
rameter 6, which determines the decay rate of the average 
weights as a function of the distance. We set = a as larger 
amounts of uncertainty imply that property values may be 
spread on larger domains of Qp due to injecting uncertainty. 

5.3 The Obfuscation Algorithm 

Our algorithm for computing a (/c, £)-obfuscation of a 
graph with respect to a vertex property P is outlined as 
Algorithm 1. Targeting for high utility, the algorithm aims 
at injecting the minimal amount of uncertainty needed to 
achieve the required obfuscation. Computing the minimal 



Algorithm 1 (&;, £)-obfuscation 

Input: G — (V,E), vertex property P, obfuscation level k, 
tolerance e, size multiplier c, and white noise level q. 
Output: A (k, e)-obfuscation G of G with respect to P. 
1: at <- 

2: G U <- 1 

3: repeat 

4: (e, G) <— GenerateObfuscation (G, a Ul P, k : e, c, q) 
5: if e = oo then a u <(— 2a u 
6: until e / oo 

7: G found <~ G 

8: while o~i + S < o~ u do 

9: a <- (ai + cr u )/2 

10: (e, G) <(— GenerateObfuscation(C, a u , P, k, e, c, q) 
11: if £ = oo then o~i ^— a 
12: else Gf OU nd ^— G\ cr u <(— a 

13: return G foU nd 



amount of uncertainty is achieved via a binary search on the 
value of the uncertainty parameter a. 

The binary-search flow of Algorithm 1 is determined by 
the function GenerateObfuscation, which is shown as Algo- 
rithm 2. The function GenerateObfuscation returns a pair 
(e, G) where e = oo or < e < e. In the first case, the func- 
tion could not find a (k, e)-obfuscation with the given uncer- 
tainty parameter. In the latter case, G is a (k, e) -obfuscation 
of G with respect to P, and thus, also a (k, e)-obfuscation. 

The obfuscation algorithm starts with an initial guess of 
an upper bound a u , which is iteratively doubled until a 
(k, £)-obfuscated graph is found. Then, the binary-search 
process is performed using gz = as the lower bound, and 
the upper bound a u that was found. The binary search ter- 
minates when the search interval is sufficiently short, and the 
algorithm outputs the best (k, e)-obfuscation found (i.e., the 
last one that was successfully generated, because it will be 
the one obtained with the smallest a). 

The function GenerateObfuscation (Algorithm 2) aims at 
finding a (k, s)-obfuscation of G using a given standard de- 
viation parameter a. First, it computes the cr-uniqueness 
level U(r(P(v)) for each vertex v G G. The more unique a 
vertex is, the harder it is to obfuscate it. Hence, in order to 
use the "uncertainty budget" a in the most efficient way, the 
algorithm performs the following two pre-processing steps. 

(Line 2): Since it is allowed not to obfuscate e\V\ of the 
vertices, the algorithm selects the set H of [§|V|] vertices 
with largest uniqueness scores, which are the vertices that 
would require the largest amount of uncertainty, and ex- 
cludes them from the subsequent obfuscation efforts. In later 
steps, the algorithm will inject uncertainty only to edges 
that are not adjacent to any of the vertices in H . (The 
algorithm could also receive if, or part of if, as an input, 
instead of fully selecting it on its own.) 

(Line 3): The set of vertices not in H will need to be 
obfuscated. To obfuscate more unique vertices, higher un- 
certainty is necessary. Thus, edges need to be sampled with 
higher probability if they are adjacent to unique vertices. In 
order to handle this sampling process, our algorithm assigns 
a probability Q(v) to every v £ V, which is proportional to 
the uniqueness level U a (P(v)) of v. 

After that, the search for a (k, e) -obfuscation starts: since 
the algorithm is randomized and there is a non-zero prob- 
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Algorithm 2 GenerateObfuscation 

Input: G — (V, E), P, /c, e, c, g, and standard deviation a. 
Output: A pair (e, (5), where (5 is a (k, e)-obfuscation (with 
e < s), or e = oo if a (k, £)-obfuscation was not found. 

1: for all v £ V compute the cr-uniqueness U a (P(v)) 

2: if <- the set of [§ vertices with largest C/a(P(v)) 

3: for all v £ V do Q( v ) <- U*(P(y))/ J2 u ev ^( P W) 

4: e ^— oo 

5: for £ times do 

6: Ec^E 

7: repeat 

8: randomly pick a vertex u £ V \ if according to Q 
9: randomly pick a vertex v £ V \ H according to Q 

10: if (u, v) £ E then E c <- E c \ {(u, v)} 

11: else Ec^^cU {(ix, v)} 

12: until |£ c | = c\E\ 

13: for all e £ £ c do 

14: compute cr(e) 

15: draw u> uniformly at random from [0, 1] 
16: if w < q 

17: then draw r e uniformly at random from [0, 1] 
18: else draw r e from the random distribution R a ( e ) 
19: if e £ E then p(e) ^— 1 — r e else p(e) ^— r e 
20: £' ^\{v eV : not /c-obfuscated by G' = (V, p)}|/|V| 
21: if e' < £ and e' < £ then e <- e'\ G <- G' 
22: return (e, G) 



ability of failure, t attempts to find a {k, £)-obfuscation are 
performed (Lines 5-22; in our experiments we used t = 5). 

Each attempt begins by randomly selecting a subset Ec Q 
V2, which will be subjected to uncertainty injection (Lines 6- 
12). The set Ec, whose target size is \Ec\ = c\E\, is initial- 
ized to be E (Line 6). Then, the algorithm randomly selects 
two distinct vertices u and v, according to the probability 
distribution Q, such that none of them is in H (Lines 8-9). 
The pair of vertices (u, v) is removed from Ec if it is an 
edge, or added to Ec otherwise (Lines 10-11). The process 
is repeated until Ec reaches the required size c\E\. Since 
in typical graphs, the number of non-edges is significantly 
larger than the number of edges, i.e., \E\ <C | V2 1/2, the loop 
in Lines 7-12 ends very quickly, for small values of c, and 
the resulting set Ec includes most of the edges in E. 

Next, in Line 14, we redistribute the uncertainty levels 
among all pairs e £ Ec in proportion to their uniqueness 
levels. Specifically, we define for each e = (u, v) £ Ec its 
cr-uniqueness level, 

UAe) . = U„(P(u)) + U«(P(v)) ^ 

and then set 

so that the average of a(e) over all e £ Ec equals a. 

Given the edge uncertainty levels, cr(e), we select for each 
pair of vertices e £ Ec a random perturbation r e . For the 
majority of the pairs (an (1 — g)-fraction, where the input 
parameter q is small) we select r e from the random distribu- 
tion Pcr(e) (see Equation (6)). For the remaining g- fraction 
of pairs we select r e from the uniform distribution on [0, 1]. 
If e is an actual edge (e £ E), it turns into an uncertain 



edge in G with associated probability of p(e) = 1 — r e . If e 
is a non-edge in G (e £ Ec \ E), it turns into an uncertain 
edge in G with probability p(e) = r e (Line 19). 

If the algorithm finds a (fc, £)-obfuscated graph in one of 
its t trials, it returns the obfuscated graph with minimal 
£. If, on the other hand, all t attempts fail, the algorithm 
indicates the failure by returning e — 00. 

6. UTILITY OF THE UNCERTAIN GRAPH 

In order to prove the practical relevance of our proposal, 
we need to show that: (1) the uncertain graph maintains 
high utility, i.e., it is highly similar to the original graph in 
terms of characteristic properties; and (2) the computation 
of these properties can be carried out in reasonable time. 

In the rest of this section, we discuss several graph statis- 
tics and show how to compute them in uncertain graphs. 
In our experimental assessment, we use those statistics to 
evaluate the utility of the proposed graph obfuscation. 

Further evidence to the usefulness of publishing an uncer- 
tain graph is provided by the many recent papers on mining 
and querying uncertain graphs [14, 15, 24, 36, 37, 38]. 

6.1 Sampling 

Given a standard (certain) graph G, let S[G] be the value 
of a statistical measure S for G. Examples of such a sta- 
tistical measure S are the average degree, the diameter, the 
clustering coefficient of G, and so on. In order to define the 
value of S in an uncertain graph G = (V, p), the most natu- 
ral choice is to consider the expected value of S [G], namely, 

E(S[G})= Yl Pr(W)-S(W), (8) 

where Pr(VF) is given in Equation (1). While for some statis- 
tics it is possible to compute the expected value in Equa- 
tion (8) without explicitly performing a summation over 
the exponential number of possible worlds (as we will see 
in Section 6.2), for other statistics such a computation re- 
mains infeasible. Hence, we have to resort to approxima- 
tion by sampling. Namely, we sample a subset of possible 
worlds W' C W(G) according to the distribution induced 
by the probabilities Pr(W), and then take the average S of 
the statistic S in the sampled worlds as an approximation 
of E(S[G]): 

1 1 wew 

Sampling a possible world according to the distribution 
Pr(VK) is carried out by sampling independently each edge e 
with probability p(e). 

The following lemma provides a probabilistic error bound 
for approximating the expected value by an average over a 
number of sampled worlds. 

Lemma 2. Let G = (V, p) be an uncertain graph and as- 
sume that S is a graph statistic that satisfies a < S < b. Let 
r = denote the number of sampled worlds and S be the 
average of the statistic S over those worlds, Equation (9). 
Then for every £ > 0, 

Pr(|£(S[G])-S|> £ )<2exp(-^^). (10) 
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PROOF. Let W' = {Wi}i<i< r be the set of r graphs that 
were sampled from G = (V,p). Then Si = S[Wi], 1 < 
i < r, are independent and identically distributed random 
variables. Since E(S % ) = E(S[G}) for all 1 < i < r, it follows 
that also E(S) = E(S[G\). Hence, inequality (10) follows 
directly from Hoeffding's inequality [13]. □ 

Corollary 1. For given error bounds and probability of 
failure S, we have Pt(\E(S[G]) — S\ > e) < 5, provided that 

r>H^) 2 M!)- 

In the next section, we define a number of scalar and 
vector statistics of interest; when possible, we also provide 
an explicit computation of E(S[G]). 

6.2 Statistics Based on Degree 

Let di , . . . , d n denote the degree sequence in a graph 
G. The statistic S is called a degree-based statistic if 
S = F(di, . . . , d n ) for some function F. Examples of such 
statistics are: 

• Number of edges: S NE — \ J2 v ev ^v- 

• Average degree: Sad = ^Y, v ev dv - 

• Maximal degree: S M r> = max V £v d v . 

• Degree variance: 1 S DV = \ J2 v ev ( dv ~ &d) 2 . 

When G is an uncertain graph, d\ , . . . , d n are random vari- 
ables. If F is a linear function, then we have 



E(S[G}) = E(F(d u . . . , dn)) = F(E(di) 



,E(d n )). (11) 



Hence, since the expected degree of a vertex v £ V is equal 
to the sum of probabilities of its adjacent edges, the com- 
putation of the expected statistic is easy, in the case of a 
linear function. As the first two examples above, S N e and 
£ A d, correspond to a linear function F, we have: 

E(S NE [G]) = EUj2 d A =\Y1 E p(^)=Ep( e )> 

\ vev J veVuev\v eev 2 

and 

E(S A n[G]) = E(-J2dv)=-Yl E P(^^) = " E PW" 

\ n v(EV J U v(EVueV\v 71 e£V 2 

Things are less simple when F is non-linear, since then 
Equation (11) does not hold. This is the case with the lat- 
ter two examples — the maximal degree (F = max) and 
the degree variance (F is quadratic). For these statistics we 
adopt the sampling approach described in the previous sec- 
tion. Since the maximal degree is at most n — 1, the statistic 
Smd satisfies Corollary 1 with a = and b = n — 1. Simi- 
larly, the statistic SW satisfies Corollary 1 with a = and 
b = (n — l) 2 . It should also be noted that we can compute 
E(Suv[G]) precisely. However, the cost of evaluating the 
corresponding formulas, which we omit herein, is quadratic 
in the number of vertices. 

We proceed to describe two additional statistics that are 
based on the degree distribution. In the following we use 
A(cT), with < d < n — 1, to denote the fraction of vertices 
in the graph G that have degree d. 



x This is one of the measures of graph heterogeneity [27]. 



The first statistic, denoted by SW, is the power-law ex- 
ponent of the degree distribution. For this statistic, we 
assume that the degree distribution follows a power law, 
A(d) ~ gP 7 , and SW is an estimate of —7. In our experi- 
ments, we focused on higher degrees where the power law fits 
better, and we fitted the exponent ignoring smaller degrees. 

The second statistic is the degree distribution itself, 
Sjjjj := (A(0), A(l), . . . , A(n - 1)). As opposed to all pre- 
vious statistics, which were scalar, this one is a vector. In 
fact, each of the previous statistics may be derived from the 
degree distribution. To approximate Suu[G] we adopt once 
more the sampling approach: for every degree d, we approx- 
imate A(d) by the average A(d) obtained over the sampled 
possible worlds. 

6.3 Statistics Based on Shortest-path Distance 

Other interesting measures characterizing a graph are 
those based on the shortest-path distance between pairs of 
vertices. Computing distance distributions on large graphs 
is far from trivial, as explained in the survey of Kang et 
al. [17]. While exact solutions using breadth-first search or 
Floyd's algorithm are out of question, there is still no consen- 
sus in the research community on which approximate tech- 
nique is best [9]. Some methods are based on sampling, for 
example, performing a breadth- first search from a selected 
set of vertices [6, 18], and other are based on information 
diffusion [3, 17, 23]. While the former are simpler to im- 
plement, diffusion-based techniques have the advantage of 
being more general (they are natively designed for directed 
graphs, while most sampling methods only work for undi- 
rected ones) and scale more gracefully. 

Defining the distance between pairs of vertices in uncer- 
tain graphs is not an easy task since, typically, the cor- 
responding ensemble of possible worlds will include discon- 
nected instances; in such disconnected possible worlds, some 
of the pairwise distances are infinite [24] . We directly avoid 
this problem by defining the distance-based measures S only 
on pairs of vertices that are path-connected. 

We consider five measures: 

• Average distance: Sapd is the average distance among 
all pairs of vertices that are path-connected. 

• Effective diameter: SWam is the 90- th percentile dis- 
tance among all path-connected pairs of vertices, i.e., 
the minimal value for which 90% of the finite pairwise 
distances in the graph are no larger than. In our exper- 
iments, we used the variant that linearly interpolates 
between the 90-th percentile and the successive integer. 

• Connectivity length: The statistic S c -l is defined as the 
harmonic mean of all pairwise distances in the graph, 

ScL = -fi-li ( E( ^ )gv , g^)" 1 [20]. Note that 

by taking ^. g ^ — - = for non path-connected pairs 

(u,v), the connectivity length can be defined as the 
average over all vertex pairs, independently on whether 
they lie in the same connected component. 

• Distribution of pairwise distances: Spud is the distribu- 
tion of pairwise distances in the graph, where S F r>r>[k] 
is the number of pairs of vertices whose distance equals 
k, for 1 < k < n — 1, and S^ddIoo] is the number of 
pairs of vertices that are not path-connected. 

• Diameter: S^iam is the maximum distance among all 
path-connected pairs of vertices. 
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For computing the above measures we rely on sampling. 
It is easy to see that Lemma 2 and Corollary 1 hold for each 
of those statistics with a = 1 and b = n — 1. 

To estimate the distance distribution in a given (certain) 
graph, we use Hyper ANF [3], a diffusion-based algorithm 
that provides a good tradeoff between accuracy guarantees 
and execution time. As the algorithm is probabilistic, the 
results that it gives may drift from the real ones, depending 
on the number of registers used for the evaluation. Such 
drifts affect the variance over the value obtained for each 
point of the distance distribution. To limit the effect of such 
probabilistic drifts, we repeat the execution of HyperANF 
and used jackknifing [26] to infer the standard error of the 
statistics that we compute; in our experiments this error 
ranges between 0.2% and 2%. 

While the HyperANF approach is viable for the first four 
statistics described above, it falls short in estimating the 
diameter. Exact diameter estimation is difficult and even 
heuristic methods such as [9] would be too inefficient to be 
executed on many sampled worlds. As a result, we focus on 
estimating a lower bound S'DiamLB for Sui am - such a lower 
bound is computed as the largest distance t for which the 
approximate distance distribution computed by HyperANF 
is nonzero; i.e., it is the largest distance t for which we esti- 
mate that there is at least one pair of vertices of distance t 
from each other. 

6.4 Clustering Coefficient 

The clustering coefficient Sec measures the extent to 
which the edges of the graph "close triangles." More for- 
mally, given a graph G, let T 3 [G] be the number of cliques of 
size 3 in the graph G, and T^G] be the number of connected 
triplets. The clustering coefficient Scc[G] of a graph G is 
then defined as Scc[G] = T 3 [G]/T 2 [G}. Since T 3 [G] < T 2 [G], 
the clustering coefficient is a number between and 1. 

Example 3. Let K 3 be the complete graph on three ver- 
tices. Then T 3 [K 3 ] = 1 and T 2 [K 3 ] = 1. Hence, Scc[K 3 ] = 
1. Consider next the graph G on three vertices u,v,w with 
two edges only — (u, v) and (u,w). Then T 3 [G] — and 
T 2 [G] = 1, whence S C c[G] =0. 

Given an uncertain graph G, we can estimate the ex- 
pected clustering coefficient E(Scc[G]) by sampling (see 
Section 6.1). Since the clustering coefficient takes values 
between and 1, we can apply Lemma 2 with a = and 
b = 1. Thus, we can estimate E(Scc[G]) within an error 
of at most £ and probability of success at least 1 — S by 
sampling at least r = m(|) possible worlds. 

7. EXPERIMENTAL ASSESSMENT 

The objective of our experimental assessment is to show 
that the proposed technique is able to provide the required 
obfuscation levels while maintaining high data utility. In 
particular, we set the following concrete subgoals. For given 
values of k and e, we want to assess: 

1. the level of noise (specified by the value of a) needed to 
achieve (/c, e) -obfuscation; 

2. the running time of the obfuscation algorithm; 

3. the error in the statistics of the obfuscated graph with 
respect to the original graph; 

4. how the proposed method compares with random-per- 
turbation methods for the same levels of obfuscation. 



Table 2: Values of a that yielded a (k, £)-obfuscation 
obtained by Alg. 1. In all cases q — 0.01 and c = 2, 
except for the two cases marked (*) where c = 3. 



±^f ct o tioc o 


k 


e = 10" a 




e = 10- 4 






on 


5.9605 


io- 


8 


1.6153 


• io- 


-5 


dblp 


60 


2.9802 


io- 


7 


3.2206 


• io- 


-3 




100 


1.8775 


io- 


5 


1.0711 


• io- 


-2 




20 


2.2948 


io- 


5 


2.6343 


• io- 


- l 2 


f lickr 


60 


1.0397 


io- 


3 


7.3275 


• io- 


2 (*) 




100 


5.8624 


io- 


3 


2.9273 


• io- 


■'(*) 




20 


5.9605 


io- 


8 


5.9605 


• io- 


-8 


Y360 


60 


5.9605 


io- 


8 


1.0133 


• io- 


-6 




100 


5.9605 


10- 


8 


1.1146 


• 10- 


-5 



Table 3: Computation (real) time in edges/sec. 
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For our experiments, we use three large real- world datasets. 
dblp is a co-authorship graph extracted from a recent snap- 
shot of the DBLP database considering only journal publi- 
cations. 2 Vertices represent authors, and there is an undi- 
rected edge between two authors if they have authored a 
journal paper together. 

flickr is a popular online community for sharing photos, 
with millions of users. 3 In addition to many photo-sharing 
facilities, users are creating a social network by explicitly 
marking other users as their contacts. 

Y360: Yahoo! 360 was a social-networking and personal- 
communication portal. In the Y360 dataset, edges represents 
the friendship relationship among users. 

The graphs sizes vary from 226 413 vertices of dblp, 
588166 of flickr, to 1226 311 of Y360, with different den- 
sities; Y360 is the largest but also the sparsest dataset. The 
main statistics (as defined in Section 6) of the three datasets 
are reported in Table 4. 

7.1 Parameter Tuning and Running Time 

In our first set of experiments, we considered three obfus- 
cation levels, k £ {20,60,100}, and two possible tolerance 
values, £ £ {10 -3 , 10 -4 }. We experimented with different 
values for q and c (with q £ {0.01, 0.05, 0.1} and c £ {2, 3}), 
but here we present only the case q — 0.01 and c = 2 (except 
for two instances that will be discussed below). In Table 2, 
we report the minimal values of a, as found by Algorithm 1, 
that yielded a (/c, e)-obfuscation for given values of k and e. 

As expected, larger k or smaller £ required larger values 
of a, because more noise was needed in order to reach the 
desired level of obfuscation. In some cases, Algorithm 1 
failed to find a proper upper bound for a in the loop in 



2 http : //www . inf ormatik . uni-trier . de/~ ley/db/ 
3 http : //www . flickr . com/ 
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Table 4: The sample mean on a sample of size 100, with e = 10 4 . The last column is the average (over all 
statistics) of the relative absolute difference between the sample mean and the real value of the statistics. 
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Table 5: The relative sample standard error of the mean (SEM) on a sample of size 100, with e — 10 -4 (the 
other parameters are set as in Table 2). For every statistics, the value shown is the sample standard deviation, 
divided by the square root of the sample size and normalized by the sample mean. The last column is the 
average of the relative sample standard errors over all of the statistics. 
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Lines 3-6. In those cases, increasing the parameter c to 3 
resolved the problem. 

The obfuscation algorithm was implemented in Java and 
run on an Intel Xeon X5660 CPUs, 2.80 GHz, 12 MB cache 
size. Table 3 reports the running times (expressed in edges 
per second) of the same experiments for which we reported 
in Table 2 the values of a. As explained above, we used in 
all cases q = 0.01 and c = 2, except for the two cases marked 
by (*) in which c = 3. We note that using smaller values of c 
has the benefit of keeping the graph size under control; such 
a benefit is of special importance for large networks. Smaller 
values of c also reduce the runtime of Algorithm 2, where the 
main loop (Lines 13-19) is over c\E\ edges. This effect is ev- 
ident in Table 3, where the performance drops substantially 
in the two cases where c = 3. As expected, the performance 
slightly decreases when k increases or e decreases, due to the 
increased efforts to achieve a higher obfuscation level. We 
note that the smaller computation times required for Y360 
are due to the fact that this dataset turns out to be easier 
to obfuscate than the others (as witnessed also by the small 
final values of a as reported in Table 2). 

The parameter q just introduces some amount of "white 
noise" in the graph. Using higher values of q enhances ob- 
fuscation but it also reduces the utility of the final released 
graph. Due to space limitations, we present only results for 
q = 0.01. A more elaborated set of plots, for different set- 
tings of q and other obfuscation parameters, will be given in 
an extended version of this paper 4 . 

4 A complete set of plots, along with the code of Algorithm 1, 
is available at http://boldi.dsi.unimi.it/obfuscation/. 



7.2 Data Utility 

Next, we computed statistics of interest on the obfuscated 
graphs, using the sampling method (Section 6.1). 5 For ev- 
ery obfuscated graph, we sampled 100 possible worlds and 
for each of them we computed all the scalar statistics listed 
above. The mean values obtained are shown in Table 4. 
Those values are very concentrated, as witnessed by Ta- 
ble 5, that reports the relative sample standard error of the 
mean (also called SEM; it is obtained as the sample stan- 
dard deviation divided by the square root of the sample size 
and normalized by the sample mean); the last column re- 
ports the average computed over all the statistics. As can 
be seen, all statistics are very well concentrated; on aver- 
age, the fluctuations for all statistics are of about 3% (last 
column of Table 5), but most of them (see, for example, 
or S A r>) exhibit a much higher level of concentration. 
There is a weak dependence on k and also on s (the latter 
dependence is not shown here). 

We proceed to comparing the sample mean of the statis- 
tics obtained with their real values on the original graph (see 
again Table 4) . The quality of the estimation decreases when 
obfuscation becomes larger: in the last column of the ta- 
ble, we computed the average statistical error over all scalar 
statistics, that is, the relative absolute difference between 
the estimate and the real value. With small values of k, 
e.g., k — 20, the error is always well below 15%; larger val- 
ues of k introduce larger errors, up to 70.5% when k = 100 



5 For Sne and Sad we use the exact formulas (Sec. 6.2). The 
results are almost identical to those obtained by sampling. 
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Figure 2: The distribution of pairwise distances Spdd; the small (red) dots correspond to the distribution in 
the real dblp graph; the boxplots give the distributions for the case k = 20, e = 10" 3 (left) and k 100, e 10" 4 
(right). As usual, the two whiskers represent the smallest and largest values observed across the samples, 
whereas the box represents the range between the lower and the upper quart iles.) 




Degree 




Degree 



Figure 3: The distribution of degrees Se>e>; the small (red) dots correspond to the distribution in the real dblp 
graph; the boxplots give the distributions for the case k = 20, e — 10 -3 (left) and k = 100, e — 10 -4 (right). 



in the dblp dataset. Observe that some statistics (e.g., de- 
gree variance or clustering coefficient) are more affected by 
error than others. 

The behavior described for scalar statistics is also ob- 
served with vector statistics. For example, Figure 2 shows 
&dd (the distribution of the pairwise distances) in the origi- 
nal dblp and in two obfuscated versions. Here, two extreme 
cases are presented: For k — 20 and e — 10 -3 the distri- 
bution obtained is qualitatively very similar (as witnessed 
also by the scalar distance-based statistics in Table 4); con- 
versely, for k — 100 and e = 10 -4 , the estimated distribution 
is quite far from the original one. In Figure 3 we present a 
similar plot for the degree distribution: for every degree, we 
considered the distribution of the frequency of that degree 
across all possible worlds. In this case, the approximation 
is very concentrated and its mean almost coincides with the 
real degree frequency, even for k = 100 and e — 10 -4 . 

7.3 Comparative Evaluation 

We finally compare our proposed method with random- 
perturbation methods that publish a standard graph (in ar- 
ticular the methods described by Bonchi et al. [4]): 



• random sparsification: given a parameter p, each edge 
e £ E is removed from the graph with probability p; 

• random perturbation: given a parameter p, first each 
edge e £ E is removed from the graph with probability 
p, then each non-existing edge in Vb \ E is added with 
probability ^\v\y m • 

To make the comparison possible, we must first determine 
which value of the parameter p used in these obfuscation 
algorithms corresponds to which pair (k, e) of obfuscation 
parameters. The appropriate values can be deduced by the 
anonymity level plots of the sparsified or perturbed graph 
obtained with a certain value of p: of course, any such graph 
will correspond to many pairs of parameters (k, e); for exam- 
ple, given any fixed e, an appropriate k can be determined 
by disregarding the en vertices with smallest anonymity and 
letting k be the least anonymity of the remaining vertices. 

Figure 4 shows the obfuscation levels obtained for some 
of the parameter combinations on dblp and f lickr. The 
plot shows, for every obfuscation level k, the number of ver- 
tices that have obfuscation level less than or equal to k. 
The two rectangles appearing in the plot highlight the ob- 
fuscation requirements (k,e). Figure 4 shows, for example, 
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k k 

Figure 4: Comparison of the anonymity levels obtained for dblp (left) and f lickr (right) using obfuscation, 
random perturbation and sparsification, for the parameter choices described in Section 7.3. The plot shows, 
for every obfuscation level k, the number of vertices that have obfuscation level less than or equal to k. 



Table 6: Comparison between obfuscation by uncertainty and obfuscation by random sparsification and 
perturbation. 



graph 






*S*MD 






$APD 


5*DiamLB 


^EDiam 




Sec 


rel. 


original 


716 460 


6.33 


238 


76.79 


-0.046 


7.34 


25 


8.78 


6.96 


0.38 


err. 


rand. pert, (p = 0.04) 
obf. (A; = 60, e = 10~ 3 ) 


716 393 


6.33 


230 


71.26 


-0.048 


7.09 


18.55 


7.25 


6.85 


0.36 


0.071 


713 819 


6.31 


236 


75.86 


-0.046 


7.15 


22.75 


7.21 


6.82 


0.36 


0.043 


rand. spars, (p = 0.64) 
obf. (k = 20, e = 10" 4 ) 


257 890 


2.28 


93 


11.40 


-0.124 


10.24 


36.72 


10.60 


25.77 


0.13 


0.921 


713 952 


6.31 


233 


76.18 


-0.046 


7.01 


22.59 


7.16 


6.68 


0.35 


0.050 


original 


5 801442 


19.73 


6 660 


6 200 


-0.002 


5.03 


21 


5.43 


4.80 


0.12 




rand. pert, (p = 0.64) 


5 801229 


19.73 


2 407 


820.3 


-0.0059 


4.55 


7.02 


4.15 


4.49 


0.030 


0.497 


rand. spars, (p = 0.32) 
obf. (k = 20, e = 10" 4 ) 


3 944 902 


13.41 


4 526 


2 871 


-0.003 


5.24 


19.56 


4.91 


6.69 


0.079 


0.286 


5 921470 


20.14 


5 847 


6 924 


-0.002 


4.84 


20.51 


4.81 


4.64 


0.050 


0.112 



that a random perturbation of dblp with p = 0.04 matches 
obfuscation (k = 60, s = 10 -3 ). 

We here present the comparative results in the following 

cases: 6 

• dblp with random perturbation using p = 0.04, match- 
ing k = 60 and e ~ 10 -3 ; 

• dblp with sparsification using p = 0.64, matching k = 
20 and e « 10~ 4 ; 

• f lickr with random perturbation using p = 0.32 and 
with sparsification using p = 0.64, both corresponding 
to k = 20 with s « 10~ 4 . 

For each of the two obfuscation techniques presented in [4] , 
we produced 50 samples; note that in those probabilistic 
methods, the obfuscation is a certain graph. Then we com- 
puted the statistics on each sample, and proceeded in the 
same way as we did for the obfuscated graph. 

Table 6 shows the results of the comparison. In all cases, 
the quality of the statistics as computed with our obfusca- 
tion method is much better; in one case, the relative error 
is 5% instead of the 92% imposed by sparsification to ob- 
tain the same level of obfuscation. Therefore, we can safely 
conclude that our experimental assessment on real-world 
graphs confirms the initial and driving intuition underlying 

6 The values of p used here (p e {0.04, 0.32, 0.64}) are the 
same as those used by Bonchi et al. [4]. 



this paper: by using finer-grained perturbation operations, 
such as only perturbing partially the existence of an edge, 
one can achieve the same desired level of obfuscation with 
smaller changes in the data than when completely removing 
or adding edges, thus maintaining higher data utility. 



8. CONCLUSIONS AND FUTURE WORK 

We introduce a new approach for identity obfuscation 
in graph data. In the proposed approach, the desired ob- 
fuscation is obtained by injecting uncertainty in the social 
graph and publishing the resulting uncertain graph. Our 
proposal can be seen as a generalization of random per- 
turbation methods for identity obfuscation in graphs, as it 
enables finer-grained perturbations than fully removing or 
fully adding edges. Such increased flexibility in spreading 
the noise over the edges of the graph enables achieving the 
same level of obfuscation with smaller changes in the data, 
as confirmed by our experiments on real- world graphs. 

While the results that we achieve are most encouraging, 
this work represents only a first step in a promising research 
direction. As it is often the case, new privacy-enabling tech- 
niques create novel attacks that, in turn, propel stronger 
protection mechanisms. Therefore, in our future investiga- 
tion we plan to extend and strengthen this line of research 
by further assessing its limits and merits. 
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One interesting research direction is to investigate how 
to extend our uncertainty-based approach in order to re- 
lease networks with additional information, besides the mere 
graph data, such as vertex attributes [22], communication 
logs among users, information-propagation traces, and other 
types of social dynamics. Another case of particular interest 
is that of a sequential release of a social network. In a recent 
paper, Medforth and Wang [21] demonstrated the risks of 
publishing a sequence of releases of the same network. In 
particular, they described the degree-trail attack, by which 
the vertex belonging to a target user can be re-identified 
from a sequence of published graphs, by comparing the de- 
grees of the vertices in the published graphs with the degree 
evolution of the target. The applicability of the degree-trail 
attack to our probabilistic graph release is an open research 
question. 
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